Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
This course seems like a lot of fun.
You can find my preprocessed data set and the data wrangling R script from my GitHub repository.
The learning2014 dataset has been collected as an international survey of approaches to learning (you can find more information on it from here.) during 3.12.2014 - 10.1.2015. It contains 166 observations on 7 variables.
learning2014 <- read.table("learning2014.csv", header = TRUE, sep = "\t")
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The variables are gender, age, attitude, deep, stra, surf and points.
| Variable | Definition | Value |
|---|---|---|
| age | age (in years) derived from the date of birth | integer |
| gender | gender | M (Male), F (Female) |
| attitude | global attitude toward statistics | 1 to 5 on the Likert-scale |
| deep | questions related to deep learning | natural number |
| stra | questions related to strategic learning | natural number |
| surf | questions related to surface learning | natural number |
| points | exam points | integer |
It seems like a fairly plausible assumption that a student’s attitude towards statistics has a big impact on exam success. We can start by plotting the variables attitude and points as a twoway scatter plot with a color indicator for gender and regression lines. (I use option include=FALSE when accessing the libraries to remove the warning messages by R from the output.)
# initialize plot with data and aesthetic mapping, define the visualization type (points), add a regression line and add a main title:
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col = gender)) + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")
# draw the plot
p1
Next let’s consider a more detailed plot describing the variables, their distributions and relationships. This plot contains a heapload of information! First, we can infact see that the hypothesis on the relation between attitude and exam points was plausible, as the correlation between points and attitude is indeed stonger than with points and any other variable.
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
From the summary of the data we can see the min, max, median and mean values as well as the 1st and 3rd quintiles of all the variables. One can note that the sample is biased towards female respondents, that the respondents are mostly in their early twenties and have a generally more positive than negative view of statistics as a discipline. The median and mean values of the learning variables deep, stra and surf show that the respondents are more likely to engage in deep learning than surface learning.
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Next let’s see if some of the hypothesis made above hold up against a bit more sophisticated analysis. We will consider exam points as the dependent variable and the aim is to fit a regression model to the data that would explain what affects the number of exam points.
From the plots above, we can deduct that attitude, stra and surf have the highest correlations with points in absolute terms. Let’s thus choose these as our explanatory variables and fit the following linear model with multiple explanatory variables:
# create a regression model with multiple explanatory variables
points_mmodel <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(points_mmodel)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the summary of the model we can see the results. Under “call” we can see a recap of the model. Under “residuals” the summary prints out important information on the residuals. Recall that the residuals of a linear regressin model are assumed to be normally distributed with a zero mean and a constant variance. We can see that the median is quite close to zero, but we should conduct further statistical tests to be certain that the mean is statistically indifferent from zero.
Under coefficients we have the estimated effects of the explanatory variables on the dependent variable, their standard errors, t-values and p-values. The last two values are maybe the most important, as they tell the value of the estimates. From the printout we can immediately see that only the effects of intercept and attitude on the dependent variable are statistically significant and different from zero. The effects stra and surf are not and should be removed from the model.
Let’s then fit the model again without the two statistically insignificant variables, that is, as a simple linear regression:
# create a regression model with multiple explanatory variables
points_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(points_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now we can see that the median of residuals is smaller than in the multiple regression, which is a good thing. We can also see that both the effect of intercept and of attitude are now highly statistically significant, whereas before when we included also the other two variables, the intercept was slightly less significant.
From the summary of our simple linear regression model we can see that the estimated effect of attitude on points is 3.5255. This means that a change of one unit in attitude (and all other variables are held constant) corresponds to a 3.5255 increase in the exam points. Even more concretely, a student with an attitude score of 4 is expected to score 3.5255 points more on the exam than a student with an attitude score of 3. This result clearle validates the hypothesis made in part 2 of attitude towards statistics being an important predictor of success in the exam.
The “multiple R-squared” printed as a part of the summary gives the coefficient of determination, or the proportion of variance in the value of the dependent variable that is predicted by the model. In our model the intercept is included, which means that R-squared is the square of the sample correlation coefficient between the observed outcomes and the predictor values. The value of R-squared in our model is 0.1906, which means that 19.06 % of the variance of the dependent variable is explained by our model. The remaining variance is explained some other factors that we have not included in our model.
Notice that R-squared is higher in the multiple regression model, even though the additional explanatory variables are insignificant. This is because the R-squared automatically increases when any explanatory variables are added to a model. To account for this, R prints also “adjusted R-squared” that adjusts for the number of explanatory variables relative to the number of data points.
When using a linear regression model, we make two important assumptions on the data generating process: - the target variable is a linear combination of the model parameters - errors are normally distributed with zero mean and a constant variance
The second assumption implies also that the errors should not be correlated and the size of a given error should not depend on the explanatory variable.
We can study the validity of the model by plotting the following three diagnostic plots: Residuals vs fitted values, Normal QQ-plot and Residuals vs. leverage.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(points_model, which = c(1,2,5))
The assumption on the constant variance of the errors can be considered using a simple scatter plot of residuals vs. model predictions (fitted values). Any pattern in this plot implies problem. This plot appears to show no patterns, and we can with reasonable certainty accept the constant variance of errors assumption.
Normal QQ-plot allows us to study the normality of the errors. The plotted values should be reasonably close to the straight line. We can see from the plot that this is indeed the case.
From the residuals vs. leverage plot we can study the impact single observations have on the model. From the third plot we can see, that no single observation appears to stand out.
Thus with the help of these diagnostic plots we can conclude that the assumptions of the linear regression model are valid.
You can find my preprocessed data set and the data wrangling R script from my GitHub repository.
The alc data set contains data on student alcohol consumption and it is a joined data set from two data sets included in the Student Performance Data. You can learn more about the data here.
The data contains the combined results from two surveys conducted on students on mathematics and portuguese classes. This data was merged in the data wrangling part of this exercise. The variables not used to join the data sets were combined by averaging. Because we are interested in studying the alcohol use of students, we have defined two new variables:
The data contains 382 observations on 35 variables. Some of the values are numerical, some dichotomous, but there are also variables with different structures.
alc <- read.table("alc.csv", header = TRUE, sep = ";")
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
Now we will present a few hypothesis about the possible links between some variables in the data and alcohol consumption. First, as in the DataCamp exercises, it seems like a plausible thing to assume that sex, absences and failures are correlated with alcohol consumption. As a fourth candidate for we will choose going out (goout), with the idea being that students who go out partying a lot, also end up consuming more alcohol.
Let’s formally state the hypothesis:
All four hypothesis are associated with a positive correlation.
Now we want get some idea on how the chosen variables are distributed and what could be their relaionship with alcohol consumption.
Let’s start by taking a look at the distributions of the chosen variables and the two variables of interest. From the bar plots it is easy to see that in our sample alcohol consumption is strongly skewed towards low consumption and that the number of students who use a lot of alcohol is less than half of those students who use alcohol in a moderate manner. From the plots we can also see that there are slightly more female than male respondents, that absences are strongly skewed towards little of no absences with a few outliers, that most students pass their courses and that going out is the only variable that follows a distribution that even remotely resembles a normal distribution.
ggplot(data = alc, aes(x = alc_use)) + geom_bar()
ggplot(data = alc, aes(x = high_use)) + geom_bar()
ggplot(data = alc, aes(x = sex)) + geom_bar()
ggplot(data = alc, aes(x = absences)) + geom_bar()
ggplot(data = alc, aes(x = failures)) + geom_bar()
ggplot(data = alc, aes(x = goout)) + geom_bar()
We can also easily draw a plot that emphasizes the differences in alcohol consumption between male and female students. The plot offers support for our first hypothesis.
g2 <- ggplot(data = alc, aes(x = high_use))
g2 + geom_bar() + facet_wrap("sex")
First, we ignore gender, and cross-tabulate alcohol consumption to the means of the other three variables. We can see that the means of the other three variables could show an upward sloping trend. This is indeed what we would expect on the basis of our hypothesis.
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_absences = mean(absences), mean_failures = mean(failures), mean_goout = mean(goout))
## # A tibble: 9 x 5
## alc_use count mean_absences mean_failures mean_goout
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.0 140 3.357143 0.1071429 2.750000
## 2 1.5 69 4.231884 0.1449275 2.869565
## 3 2.0 59 3.915254 0.2203390 3.084746
## 4 2.5 44 6.431818 0.1363636 3.295455
## 5 3.0 32 6.093750 0.5625000 4.000000
## 6 3.5 17 5.647059 0.4705882 3.764706
## 7 4.0 9 6.000000 0.2222222 4.222222
## 8 4.5 3 12.000000 0.0000000 3.666667
## 9 5.0 9 6.888889 0.5555556 4.222222
If we include sex in the grouping variables, the trends in the means become somewhat more pronounced.
alc %>% group_by(alc_use, sex) %>% summarise(count = n(), mean_absences = mean(absences), mean_failures = mean(failures), mean_goout = mean(goout))
## # A tibble: 17 x 6
## # Groups: alc_use [?]
## alc_use sex count mean_absences mean_failures mean_goout
## <dbl> <fctr> <int> <dbl> <dbl> <dbl>
## 1 1.0 F 87 3.781609 0.10344828 2.896552
## 2 1.0 M 53 2.660377 0.11320755 2.509434
## 3 1.5 F 42 4.642857 0.04761905 2.857143
## 4 1.5 M 27 3.592593 0.29629630 2.888889
## 5 2.0 F 27 5.000000 0.25925926 3.296296
## 6 2.0 M 32 3.000000 0.18750000 2.906250
## 7 2.5 F 26 6.576923 0.19230769 2.961538
## 8 2.5 M 18 6.222222 0.05555556 3.777778
## 9 3.0 F 11 7.636364 0.54545455 4.090909
## 10 3.0 M 21 5.285714 0.57142857 3.952381
## 11 3.5 F 3 8.000000 0.33333333 3.333333
## 12 3.5 M 14 5.142857 0.50000000 3.857143
## 13 4.0 F 1 3.000000 0.00000000 4.000000
## 14 4.0 M 8 6.375000 0.25000000 4.250000
## 15 4.5 M 3 12.000000 0.00000000 3.666667
## 16 5.0 F 1 3.000000 0.00000000 5.000000
## 17 5.0 M 8 7.375000 0.62500000 4.125000
We can also do the cross-tabulations with “high-use” as the variable of interest. Here the difference between means for students that are heavy users of alcohol and those that are not, are quite evident.
alc %>% group_by(high_use, sex) %>% summarise(count = n(), mean_absences = mean(absences), mean_failures = mean(failures), mean_goout = mean(goout))
## # A tibble: 4 x 6
## # Groups: high_use [?]
## high_use sex count mean_absences mean_failures mean_goout
## <lgl> <fctr> <int> <dbl> <dbl> <dbl>
## 1 FALSE F 156 4.224359 0.1153846 2.955128
## 2 FALSE M 112 2.982143 0.1785714 2.714286
## 3 TRUE F 42 6.785714 0.2857143 3.357143
## 4 TRUE M 72 6.125000 0.3750000 3.930556
We will next draw box plots of “high_use” as a dependent variable. We will use “sex” as a colour visualization and the other three as explanatory variables. From the box plots we can see that our hypothesis nro 1, 2 and 4 seem plausible; high alcohol consumption appears to be more likely associated with the respondent being a male with relatively more absences and parties. However the box plot for failures is a bit perplexing. Perhaps because the variable was so heavily skewed to zero, the plot appears to suggest that there are only some outliers that are different fom zero.
g1 <- ggplot(alc, aes(x = high_use, col = sex, y = absences))
g1 + geom_boxplot() + ylab("absences")
g2 <- ggplot(alc, aes(x = high_use, col = sex, y = failures))
g2 + geom_boxplot() + ylab("failures")
g3 <- ggplot(alc, aes(x = high_use, col = sex, y = goout))
g3 + geom_boxplot() + ylab("goout")
In this part we will use logistic regression to gain more solid knowledge on the effect our chosen variables may have on the target variable. Here the target variable is the binary high/low alcohol consumption.
# find the model with glm()
m1 <- glm(high_use ~ sex + absences + failures + goout, data = alc, family = "binomial")
From the summary of the model we can immediately see that “failures” indeed are not statistically significant, which is in line with the box plot. Other than that, the results of the regression are in line with our hypothesis. Male students are more likely to be heavy users of alcohol, as are those with more absences and those who go out more.
# print out a summary of the model
summary(m1)
##
## Call:
## glm(formula = high_use ~ sex + absences + failures + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0299 -0.7890 -0.5417 0.7857 2.3588
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.13390 0.47292 -8.741 < 2e-16 ***
## sexM 0.91999 0.25625 3.590 0.000330 ***
## absences 0.08243 0.02235 3.688 0.000226 ***
## failures 0.34560 0.21093 1.638 0.101330
## goout 0.70797 0.11998 5.901 3.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 385.06 on 377 degrees of freedom
## AIC: 395.06
##
## Number of Fisher Scoring iterations: 4
Next we will present and interpret the coefficients of the model as odds ratios and calculate their confidence intervals.
Recall that the odds ratio is defined as the ratio of the odds of Y=TRUE for individuals with property X to the odds of Y=TRUE for individuals WITHOUT property X. Here Y would be high consumption of alcohol and X e.g. the respondent being male. In the ratios we have computed, the odds ratio for variable sexM thus gives the ratio the odds of a male being a heavy drinker to the odds of a female being a heavy drinker. We can see that the ratio is more than 2.5, or that men are more than 2.5 times more likely to be heavy drinkers.
As we are working with a logistic regression, we can interpret the odds ratios for non-binary variables as odds ratios between a unit change vs. no change in the corresponding explanatory variable. Thus the odds ratio of “goout”, 2.0, can be interpreted as the ratio between the odds of someone who goes to one more party being a heavy drinker and the odds of someone who doesn’t go to that party being a heavy drinker.
Note also that as all of the odds ratios calculated are higher than one, all our chosen variables are positively associated with high alcohol consumption.
From the confidence intervals we can see that all the intervals of all the statistically significant variables do not include zero, but the interval for “failures” does. That is, the effect of this variable on the likelihood of high alcohol consumption is statistically indifferent from zero.
# compute odds ratios (OR)
OR1 <- coef(m1) %>% exp
# compute confidence intervals (CI)
CI1 <- confint(m1) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR1, CI1)
## OR1 2.5 % 97.5 %
## (Intercept) 0.01602022 0.006086805 0.03902357
## sexM 2.50925591 1.526894419 4.17886012
## absences 1.08592493 1.040580166 1.13727912
## failures 1.41283378 0.934759890 2.14797583
## goout 2.02986872 1.613471240 2.58531654
In this part, we will explore the predictive power of a model with only statistically significant relationship with high alcohol consumption. Thus we will drop failures and fit a new model:
m2 <- glm(high_use ~ sex + absences + goout, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ sex + absences + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.8153 -0.5446 0.8357 2.4740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16317 0.47506 -8.764 < 2e-16 ***
## sexM 0.95872 0.25459 3.766 0.000166 ***
## absences 0.08418 0.02237 3.764 0.000167 ***
## goout 0.72981 0.11970 6.097 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.75 on 378 degrees of freedom
## AIC: 395.75
##
## Number of Fisher Scoring iterations: 4
Now we will use the predict() function to make predictions from this model. We start by adding two columns to our data set: “probability” with the predicted probabilities and “prediction” which takes value TRUE if the value of “probability” is larger than 0.5.
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, sex, goout, high_use, probability, prediction) %>% tail(10)
## absences sex goout high_use probability prediction
## 373 0 M 2 FALSE 0.14869987 FALSE
## 374 7 M 3 TRUE 0.39514446 FALSE
## 375 1 F 3 FALSE 0.13129452 FALSE
## 376 6 F 3 FALSE 0.18714923 FALSE
## 377 2 F 2 FALSE 0.07342805 FALSE
## 378 2 F 4 FALSE 0.25434555 FALSE
## 379 2 F 2 FALSE 0.07342805 FALSE
## 380 3 F 1 FALSE 0.03989428 FALSE
## 381 4 M 5 TRUE 0.68596604 TRUE
## 382 2 M 1 TRUE 0.09060457 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 65 49
# tabulate the target variable versus the predictions with margings:
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
From the 2x2 cross tabulation of predictions versus the actual values we can see that our model is not perfect: in 15 cases it predicted a respondent to be a heavy user, when in fact the user was not, and vice versa in 65 cases. On the other hand, in 302 cases the prediction is correct. From the second table we can see the share of correct and incorrect predictions. In 21 % of the cases the prediction was incorrect, which seems like a reasonable fit.
We can also see this in a graph that plots the true values against the predictions:
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
We still want to compute the total proportion of inaccurately classified individuals, or the the training error. For this purpose we will define a loss function and then calculate the average number of wrong predictions in the training data.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7015707
loss_func(class = alc$high_use, alc$probability)
## [1] 0.2094241
That is, the training error is appr. 0.21 or 21 %. This seems like a reasonable fit.
Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a loss function is computed on data not used for finding the model. Cross-validation gives a good sense of the actual predictive power of the model.
# 10-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2198953
The average number of wrong predictions in the cross validation is 0.22 which is smaller than the 0.26 error in the model from DataCamp exercises. Thus my model has a slightly better test set performance than the example model.
The Boston dataset contains data from a study on housing values in suburbs of Boston. With the data you can for example study the effect of crime rates or air quality on the value of owner-ocupied homes. You can learn more about the data here.
The dataset contains 506 observations on 14 variables. The observations are numeric with most variables taking natural number values. There are also some with integer values and one dichotomous variable.
# load the data
data("Boston")
# glimpse at the dataset
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
Next we will look at the data in a bit more detail. Start by printing out the column/variable names and writing down the definitions of the different variables (from here):
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
| Variable | Definition |
|---|---|
| crim | per capita crime rate by town |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft |
| indus | proportion of non-retail business acres per town |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
| nox | nitrogen oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling |
| age | proportion of owner-occupied units built prior to 1940 |
| dis | weighted mean of distances to five Boston employment centres |
| rad | index of accessibility to radial highways |
| tax | full-value property-tax rate per $10,000 |
| ptratio | pupil-teacher ratio by town |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
| lstat | lower status of the population (percent) |
| medv | median value of owner-occupied homes in $1000s |
From the summary of the variables we can see minimum, maximum, median and mean values as well as the 1st and 3rd quartiles of the variables.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We can study and visualize the correlations between the different variables with the help of a correlations matrix and a correlations plot.
# First calculate the correlation matrix and round it so that it includes only two digits:
cor_matrix<-cor(Boston) %>% round(digits = 2)
# Print the correlation matrix:
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# Visualize the correlation matrix with a correlations plot:
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
From the plot above we can easily see which variables correlate with which and is that correlation positive or negative. So for example, we can see that the distance from the employment centres (dis) correlates negatively with age of the houses (age), nitrogen oxides concentration (nox) and proportion of non-retail business acres (indus). We can also see that accessibility to radial highways (rad) is highly positively correlated with property taxes (tax), as are industrial land use (indus) and air pollution (nox). On the first glimpse, the correlation appear quite intuitive and nothing stands out as very suprising. One could note that the correlation of the only dummy variable chas is almost negligible with all other variables. This could be because the overall share of observations where chas = 1 is very small.
We will later use linear discriminant analysis on our data, so we need to have varibles that are normally distributed and share the same covariance matrix across different classes. This requires the data to be scaled before fitting a model.
Thus we want to standardize the data set by scaling it. This is very easy to do in R by using the scale() function, which subtracts the mean of a variable from each value of the variable and divides this by the standard error. From the summary we can see that the standardized values are much more similar in magnitude across the different variables, so their relationships are easier to grasp. Note that the mean of each variable is now zero, as it should be. Also note that previously all variables were positive, but now at least the min values are negative for all variables. This is because the mean of all variables is larger than the smallest value of a variable.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Note that the scaled data set is now in matrix format. We want to change the class of the scaled data set into a dataframe, which we can do by converting it with a function as.data.frame():
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Next we want to create a categorical (or factor) variable of the crime rate in the scaled Boston data set. We will use the quantiles as the break points in the categorical variable. We will print the summary of the scaled crime rate, the quantile vector of this variable and a table of the new factor variable to check that everything goes as intended.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
In order to avoid confusion, we want to drop the old crime rate variable from the dataset and replace it with the new categorical variable for crime rates. This is easily done with the following two lines of code:
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Lastly for this step, we will divide the dataset to train and test sets. This will allow as further down the line to check how well our model works. We will assign 80 % of the data to the train set and the remaining 20 % to the test set. The training of the model is done with the train set and predictions on new data is conducted on the test set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Now we will fit the linear discriminant analysis on the train set. The LDA is a classification method that finds the (linear) combination of the variables that separate the target variable classes. We will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2425743 0.2450495 0.2500000
##
## Group means:
## zn indus chas nox rm
## low 0.8767382 -0.8764588 -0.1237592475 -0.8637179 0.3497135
## med_low -0.1225658 -0.2787896 0.0088923777 -0.5572569 -0.1161571
## med_high -0.3815632 0.2215319 0.1651265141 0.3885314 0.0536119
## high -0.4872402 1.0171306 0.0005392655 1.0916085 -0.4350083
## age dis rad tax ptratio
## low -0.8865399 0.8323319 -0.6893371 -0.7320563 -0.36641543
## med_low -0.3425449 0.3288227 -0.5541258 -0.4863727 -0.03602057
## med_high 0.4241465 -0.3860466 -0.4053176 -0.2989767 -0.24727254
## high 0.8308849 -0.8563886 1.6379981 1.5139626 0.78062517
## black lstat medv
## low 0.3781705 -0.74282691 0.437301853
## med_low 0.3214640 -0.14983795 -0.005120312
## med_high 0.1116508 0.05873945 0.104503881
## high -0.7561471 0.89037517 -0.700084163
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12901244 0.759218071 -1.04210214
## indus 0.02558519 -0.344602583 0.23836152
## chas -0.09277299 -0.040028431 0.13619435
## nox 0.44194526 -0.561553279 -1.39807628
## rm -0.10890240 -0.121942414 -0.12920890
## age 0.23531246 -0.421315123 -0.13624238
## dis -0.09673164 -0.343641692 0.23302831
## rad 3.21647668 0.929019857 -0.09583685
## tax -0.05696458 -0.082914159 0.72069125
## ptratio 0.13030590 0.172874684 -0.30276606
## black -0.13925272 -0.003068441 0.12665546
## lstat 0.21453204 -0.235803120 0.33057621
## medv 0.20055607 -0.346520102 -0.23336245
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9521 0.0362 0.0117
The LDA calculates the probabilities of a new observation belonging to each of the classes by using the trained model. After that the model classifies each observation to the most probable class.
The most convenient and informative way to consider the results of an LDA is to visualize with a biplot. This we can do following the Datacamp exercise with the following code:
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The arrows in this plot represent the relationships of the original variables and the LDA solution.
In this part we want to see how the LDA model performs when predicting on new data. For this end, we can use the predict() function. we start by predicting the crime classes with the test data.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 6 1 0
## med_low 5 19 4 0
## med_high 1 8 17 1
## high 0 0 0 26
From the cross tabulation of the results we can see that the model appears to predict the correct results reasonably well. The model has most problems in separating med_low from low, but most of the predictions seems to fall in the correct class.
Next we will take few steps towards clustering the data. Start by reloading the Boston dataset and standardize the dataset again. We need to do this again, because we have made also other changes to dataset.
# load the data
data("Boston")
# center and standardize variables
boston_scaled2 <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled2)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled2)
## [1] "matrix"
# change the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled)
Next we wish to calculate the distances between the observations. We do this by forming a standard Euclidian distance matrix:
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
## Warning in dist(boston_scaled2): NAs introduced by coercion
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1394 3.5267 4.9081 4.9394 6.2421 13.0045
I ran into trouble here because there are NA values in the boston_scaled2 distance matrix. The k-means algorithm wouldn’t run on this dataset, retuning an error on the NA values. Despite googling I was not able to find a solution, so I’m doing the rest of this exercise with the original Boston dataset (as in the Datacamp exercises). So, a novel try:
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
From the summary we can see the relevant moments of the calculated distances between the observations.
Then let’s move on to K-means clustering, that assigns observations to groups or clusters based on similarity of the objects. The following code runs k-means algorithm on the dataset:
# k-means clustering
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
It’s difficult to see anything from this, so let’s zoom in a bit and look at only the last five columns:
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
Next we want to know what the optimal number of clusters is. We can do this by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
From the plot above it’s evident that optimal number of clusters is not three, but two. Let’s run the k-means algorithm again with two clusters and then visualize the results:
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
Start by running the code below for the (scaled) train data that we used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
## Warning: package 'plotly' was built under R version 3.4.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Then we will use the package plotly to create a 3D plot of the columns of the matrix product:
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
This is very cool indeed! But addingdifferent colors for the different crime rate classes definetly makes it even cooler still:
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
Lastly, we are supposed to define the color by the clusters of the k-means. Here I must be doing something wrong, because the colors disappear completely.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km)
You can find my preprocessed data set and the data wrangling R script from my GitHub repository.